Overview

Joining data is one of the essential skills in data analysis. The concept of joining data is straightforward yet can be confusing at first or while transitioning from excel to python/SQL/R. This vignette aims to provide a walkthrough with an intuitive explanation of joining data in R and how to do it. One of the most used R package dplyr is used to perform these operations.

This vignette will first go through the definition of data join,and different types of join.

Note

If dplyr package is not available in the environment, simple steps including installing and importing the package(only importing if installed already) can solve the problem.

install.packages("dplyr")
library(dplyr)


The Definition

At times, data can be scattered in different dataset. It is essential to combine the datasets to perform the expected analysis.

The data join operation is when two or more datasets are combined where at least one column is similar in each dataset.

Here, the join operations will be illustrated using two very simple and short dataset.

show(first_table)
#> # A tibble: 3 × 3
#>   A     B     C_first
#>   <chr> <chr> <chr>  
#> 1 A2    B2    C2     
#> 2 A3    B3    C3     
#> 3 A4    B4    C4
show(second_table)
#>    B C_second  D
#> 1 B1       C1 D1
#> 2 B2       C2 D2
#> 3 B4       C4 D4
#> 4 B5       C5 D5


Types of Join

The primary purpose of joining two dataset is to merge records from more than one source. There are two types of joins.

  1. Mutating Join
  2. Filtering Join


Mutating Join

A Mutating Join combines data from two or more datasets based on the matched observation in both datasets. It adds new variables to a dataset from matching cases from another dataset. There are four types of mutating verbs in dplyr package.


I. Inner Join

The inner join keyword is used to select all the common rows between columns. If there are records in first table that do not match with second table, the records will be discarded from result dataset.

Inner Join

# Performing Inner Join
# Here, both the dataset have common variable, that is: C_first in first_table and C_second in second_table
# in the "by" part, the column names must follow order. For exapmple, here, first_table is the first dataset,
# and second_table is second dataset Like that, "C_first" is from first dataset and "C_second" is from second dataset
# This order must be maintained. It will not work otherwise.


first_table %>%
  inner_join(second_table, by = c("C_first"="C_second"))
#> # A tibble: 2 × 5
#>   A     B.x   C_first B.y   D    
#>   <chr> <chr> <chr>   <chr> <chr>
#> 1 A2    B2    C2      B2    D2   
#> 2 A4    B4    C4      B4    D4

In the result, there is B.x and B.y. They are both B column. As no suffix has been assigned to them, they have taken system based suffixes. However, it is possible to define suffix by the user.

# Here, the operation has taken all the matching records from first_table data and second_table data.
first_table %>%
  inner_join(second_table, by = c("C_first"="C_second"), suffix = c("_first_table", "_second_table"))
#> # A tibble: 2 × 5
#>   A     B_first_table C_first B_second_table D    
#>   <chr> <chr>         <chr>   <chr>          <chr>
#> 1 A2    B2            C2      B2             D2   
#> 2 A4    B4            C4      B4             D4


II. Left Join

The left join returns all the records from first table and the matching records from second table. If there is no match between these datasets, only the records from left dataset will appear.

Left Join

# This operation will take all the records from first_table and only the matching records from second_table.

first_table %>%
  left_join(second_table, by = c("C_first"="C_second"), suffix = c("_first_table", "_second_table"))
#> # A tibble: 3 × 5
#>   A     B_first_table C_first B_second_table D    
#>   <chr> <chr>         <chr>   <chr>          <chr>
#> 1 A2    B2            C2      B2             D2   
#> 2 A3    B3            C3      <NA>           <NA> 
#> 3 A4    B4            C4      B4             D4


III. Right Join

The right join is just the opposite of left join. It returns all the records from second table and only the matching records from first table. Here, if there is no match in record in the first table, only the records from second table will appear in the result.

Right Join

# This operation will take all the records from second_table and only the matching records from first_table.

first_table %>%
  right_join(second_table, by = c("C_first"="C_second"), suffix = c("_first_table", "_second_table"))
#> # A tibble: 4 × 5
#>   A     B_first_table C_first B_second_table D    
#>   <chr> <chr>         <chr>   <chr>          <chr>
#> 1 A2    B2            C2      B2             D2   
#> 2 A4    B4            C4      B4             D4   
#> 3 <NA>  <NA>          C1      B1             D1   
#> 4 <NA>  <NA>          C5      B5             D5


IV. Full Join

A full join merges both the datasets and returns all the rows from both datasets, it is insignificant whether or not they have matching rows.

Full Join

# This operation will take all the records from both first_table and second_table.

first_table %>%
  full_join(second_table, by = c("C_first"="C_second"), suffix = c("_first_table", "_second_table"))
#> # A tibble: 5 × 5
#>   A     B_first_table C_first B_second_table D    
#>   <chr> <chr>         <chr>   <chr>          <chr>
#> 1 A2    B2            C2      B2             D2   
#> 2 A3    B3            C3      <NA>           <NA> 
#> 3 A4    B4            C4      B4             D4   
#> 4 <NA>  <NA>          C1      B1             D1   
#> 5 <NA>  <NA>          C5      B5             D5


Filtering Join

A filtering join is as the name suggests, filters rows while joining. It does not add or remove new variable in the result. It keeps or removes observations from first table. Simply put, it is used to find the matching or non-matching observations between datasets. There are two types of filtering join verb in dplyr package.


I. Semi Join

A semi join returns the common rows between two datasets. If the first table has matching rows with second table, a semi join will return all the matching rows from first table and it will not add any other variable to the dataset.

Semi Join

# This operation will take all the records from first_table that match second_table However, it will not include additional columns in the result. Hence, it does not require the suffix keyword.

first_table %>%
  semi_join(second_table, by = c("C_first"="C_second"))
#> # A tibble: 2 × 3
#>   A     B     C_first
#>   <chr> <chr> <chr>  
#> 1 A2    B2    C2     
#> 2 A4    B4    C4


II. Anti Join

An anti join is just the opposite process of a semi join. It returns the rows that are not matching. If there are observations in the first table that do not match with the second table, using anti join will bring them forward and in addition, it will not add or remove any variable from the first table.

Anti Join

# This operation will take all the records from first_table that do not match with second_table.

first_table %>%
  anti_join(second_table, by = c("C_first"="C_second"))
#> # A tibble: 1 × 3
#>   A     B     C_first
#>   <chr> <chr> <chr>  
#> 1 A3    B3    C3


This discussion above provides the elementary discussion on joining datasets in r. Joining can occur among several tables and, at times, on more than two variables on each table. This vignette aims to provide the intuitions behind the join types and their basic operation procedure. However, there are many options to explore joining data.

References

Harry Cooney.(2019, January 28). What Are Data Joins?.
https://www.thedataschool.co.uk/harry-cooney/what-are-data-joins

Tableau. Join Your Data. Tableau Desktop and Web Authoring Help
https://help.tableau.com/current/pro/desktop/en-us/joining_tables.htm